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Abstract 

The performance of stochastic gradient de- 
scent (SGD) depends critically on how learn- 
ing rates are tuned and decreased over time. 
We propose a method to automatically adjust 
multiple learning rates so as to minimize the 
expected error at any one time. The method 
relies on local gradient variations across sam- 
ples. In our approach, learning rates can in- 
crease as well as decrease, making it suitable 
for non-stationary problems. Using a num- 
ber of convex and non-convex learning tasks, 
we show that the resulting algorithm matches 
the performance of SGD or other adaptive 
approaches with their best settings obtained 
through systematic search, and effectively re- 
moves the need for learning rate tuning. 

1. Introduction 

Large-scale learning problems require algorithms that 
scale benignly (e.g. sub-linearly) with the size of the 
dataset and the number of trainable parameters. This 
has lead to a recent resurgence of interest in stochas- 
tic gradient descent methods (SGD). Besides fast con- 
vergence, SGD has sometimes been observed to yield 
significantly better generalization errors than batch 
methods (Bottou & Bousquet, 2011). 

In practice, getting good performance with SGD re- 
quires some manual adjustment of the initial value of 
the learning rate (or step size) for each model and each 
problem, as well as the design of an annealing schedule 
for stationary data. The problem is particularly acute 
for non-stationary data. 

The contribution of this paper is a novel method to 
automatically adjust learning rates (possibly different 



learning rates for different parameters), so as to min- 
imize some estimate of the expectation of the loss at 
any one time. 

Starting from an idealized scenario where every sam- 
ple's contribution to the loss is quadratic and separa- 
ble, we derive a formula for the optimal learning rates 
for SGD, based on estimates of the variance of the gra- 
dient. The formula has two components: one that cap- 
tures variability across samples, and one that captures 
the local curvature, both of which can be estimated in 
practice. The method can be used to derive a single 
common learning rate, or local learning rates for each 
parameter, or each block of parameters, leading to five 
variations of the basic algorithm, none of which need 
any parameter tuning. 

The performance of the methods obtained without any 
manual tuning are reported on a variety of convex and 
non-convex learning models and tasks. They compare 
favorably with an "ideal SGD" , where the best possible 
learning rate was obtained through systematic search, 
as well as previous adaptive schemes. 

2. Background 

SGD methods have a long history in adaptive sig- 
nal processing, neural networks, and machine learn- 
ing, with an extensive literature (see (Bottou, 1998; 
Bottou & Bousquet, 2011) for recent reviews). While 
the practical advantages of SGD for machine learning 
applications have been known for a long time (LeCun 
et al., 1998), interest in SGD has increased in recent 
years due to the ever-increasing amounts of streaming 
data, to theoretical optimality results for generaliza- 
tion error (Bottou & LeCun, 2004), and to competi- 
tions being won by SGD methods, such as the PAS- 
CAL Large Scale Learning Challenge (Bordes et al., 
2009), where Quasi-Newton approximation of the Hes- 
sian was used within SGD. Still, practitioners need to 
deal with a sensitive hyper-parameter tuning phase to 
get top performance: each of the PASCAL tasks used 
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very different parameter settings. This tuning is very 
costly, as every parameter setting is typically tested 
over multiple epochs. 

Learning rates in SGD are generally decreased accord- 
ing a schedule of the form r)(t) = 770(1 + it) -1 . Origi- 
nally proposed as r](t) = 0(£ _1 ) in (Robbins & Monro, 
1951), this form was recently analyzed in (Xu, 2011; 
Bach & Moulines, 2011) from a non-asymptotic per- 
spective to understand how hyper-parameters like t]q 
and 7 affect the convergence speed. 

Numerous researchers have proposed schemes for mak- 
ing learning rates adaptive, either globally or by adapt- 
ing one rate per parameter ('diagonal precondition- 
ing'); see (George & Powell, 2006) for an overview. An 
early diagonal preconditioning schemes was proposed 
in (Almeida & Langlois, 1999) where the learning rate 
is adapted as 



r)i(t) = max 



(*-ir 



for each problem dimension i, where Vg is gradient 

of the ith parameter at iteration t, and [Vg.l 
is a recent running average of its square. Stochastic 
meta-descent (SMD, Schraudolph (1999; 2002)) uses 
a related multiplicative update of learning rates. Ap- 
proaches based on the natural gradient (Amari et al., 
2000) precondition the updates by the empirical Fisher 
information matrix (estimated by the gradient covari- 
ance matrix, or its diagonal approximation), in the 
simplest case: r\i = r/o/vi; the "Natural Newton" al- 
gorithm (Lc Roux & Fitzgibbon, 2010) combines the 
gradient covariance with second-order information. Fi- 
nally, derived from a worst-case analysis, (Duchi et al., 
2010) propose an approach called 'AdaGrad', where 
the learning rate takes the form 



E*= (v?)' 



The main practical drawback for all of these ap- 
proaches is that they retain one or more sensitive 
hyper-parameters, which must be tuned to obtain sat- 
isfactory performance. AdaGrad has another dis- 
advantage: because it accumulates all the gradients 
from the moment training starts to determine the 
current learning rate, the learning rate monotoni- 
cally decreases: this is especially problematic for non- 
stationary problems, but also on stationary ones, as 
navigating the properties of optimization landscape 
change continuously. 

The main contribution of the present paper is a for- 
mula that gives the value of the learning rate that will 
maximally decrease the expected loss after the next up- 




Figure 1. Illustration of the idealized loss function consid- 
ered (thick magenta), which is the average of the quadratic 
contributions of each sample (dotted blue), with minima 
distributed around the point 9* . Note that the curvatures 
are assumed to be identical for all samples. 



date. For efficiency reasons, some terms in the for- 
mula must be approximated using such quantities as 
the mean and variance of the gradient. As a result, the 
learning rate is automatically decreased to zero when 
approaching an optimum of the loss, without requiring 
a pre-determined annealing schedule, and if the prob- 
lem is non-stationary, it the learning rate grows again 
when the data changes. 

3. Optimal Adaptive Learning Rates 

In this section, we derive an optimal learning rate 
schedule, using an idealized quadratic and separable 
loss function. We show that using this learning rate 
schedule preserves convergence guarantees of SGD. In 
the following section, we find how the optimal learning 
rate values can be estimated from available informa- 
tion, and describe a couple of possible approximations. 

The samples, indexed by j, are drawn i.i.d. from a 
data distribution V . Each sample contributes a per- 
sample loss (0) to the expected loss: 



(1) 



where g R d is the trainable parameter vector, whose 
optimal value is denoted 0* = arg ming J{0). The 
SGD parameter update formula is of the form = 
0« -^Wv^, where v£° = ^£ {j) (0) is the gradient 
of the the contribution of example j to the loss, and 
the learning rate rf^ is a suitably chosen sequence of 
positive scalars (or positive definite matrices). 

3.1. Noisy Quadratic Loss 

We assume that the per-sample loss functions are 
smooth around minima, and can be locally approxi- 
mated by a quadratic function. We also assume that 
the minimum value of the per-sample loss functions 
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£«(0) = i(0-c«) T H« (o-cW) 
V« = - cW) 

where Hi is the (positive semi-definite) Hessian matrix 
of the per-sample loss of sample j, and is the opti- 
mum for that sample. The distribution of per-sample 
optima c") has mean 8* and variance S. Figure 1 
illustrates the scenario in one dimension. 

To simplify the analysis, we assume for the remain- 
der of this section that the Hessians of the per-sample 
losses are identical for all samples, and that the prob- 
lem is separable, i.e., the Hessians are diagonal, with 
diagonal terms denoted {hi, . . . , hi, . . . , ha}- Further, 
we will ignore the off-diagonal terms of S, and de- 
note the diagonal {of , . . . , of , . . . , o^}. Then, for any 
of the d dimensions, we thus obtain a one-dimensional 
problem (all indices i omitted). 



J{9) = E.^ v 



h{0 



(2) 



The gradient components are = h(6 — c^), with 
E[V e ] = h(9 - 9*) Var[V ] = h 2 a 2 (3) 
and we can rewrite the SGD update equation as 
e (t+D = e (t)_ r]h (o(t)_ c U^ 

= (1 - ntyO® + r]hd* + r/ha^ (4) 

where the £w are i.i.d. samples from a zero-mean 
and unit-variance Gaussian distribution. Inserting this 
into equation 2, we obtain the expected loss after an 
SGD update 



h ■ 



(1- V h) 2 (9^ -9*) 2 + V 2 h 2 a 2 +a u 



3.2. Optimal Adaptive Learning Rate 

We can now derive the optimal (greedy) learning rates 
for the current time t as the value rj* (t) that minimizes 
the expected loss after the next update 

T)*(t) = argmin [(1 - 7]h) 2 (6^ - 9*f + a 2 + t; 2 /iV 



arg mm 



//- [ h{6 {t) - 
-2r](6 {t) - 

(0® -e*) 2 



1 

h 



3 *\2 



ha' 



(5) 



In the classical (noiseless or batch) derivation of the 
optimal learning rate, the best value is simply rj* (t) = 
h . The above formula inserts a corrective term that 
reduces the learning rate whenever the sample pulls 
the parameter vector in different directions, as mea- 
sured by the gradient variance a 2 . The reduction of 
the learning rate is larger near an optimum, when 
(#(*) — 0*y i s small relative to a 2 . In effect, this will 
reduce the expected error due to the noise in the gra- 
dient. Overall, this will have the same effect as the 
usual method of progressively decreasing the learning 
rate as we get closer to the optimum, but it makes this 
annealing schedule automatic. 

If we do gradient descent with i]*(t), then almost 
surely, the algorithm converges (for the quadratic 
model). The proof is given in the appendix. 

3.3. Global vs. Parameter-specific Rates 

The previous subsections looked at the optimal learn- 
ing rate in the one-dimensional case, which can be triv- 
ially generalized to d dimensions if we assume that all 
parameters are separable, namely by using an individ- 
ual learning rate 77* for each dimension i. Alterna- 
tively, we can derive an optimal global learning rate rj* 
(sec appendix for the full derivation), 



Ett (hiie^-ety + hlo-i 



(6) 



which is especially useful if the problem is badly con- 
ditioned. 

In-between a global and a component-wise learning 
rate, it is of course possible to have common learning 
rates for blocks of parameters. In the case of multi- 
layer learning systems, the blocks may regroup the pa- 
rameters of each single layer, the biases, etc. This is 
particularly useful in deep learning, where the gradi- 
ent magnitudes can vary significantly between shallow 
and deep layers. 

4. Approximations 



In practice, we are not given the quantities o~i, hi and 
(9* —9*) 2 . However, based on equation 3, we can esti- 
mate them from the observed samples of the gradient: 



V, 



1 

hi 



(E[V,J) 2 



(E[V e J) +Var[V e J 



I 

hi 



E[V21 



(J) 



The situation is slightly different for the global learn- 
ing rate 77*. Here we assume that it is feasible to es- 
timate the maximal curvature h + = maXj(/ij) (which 
can be done efficiently, for example using the diago- 
nal Hessian computation method described in (LeCun 
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ct al., 1998)). Then we have the bound 

1 ZUhWP-IH) 3 



1 



Eti (TO (t) - *) a + Ata 

||E[V e ]|| 2 



E 



(8) 



because 



E 



|V 6 



E 



In both cases (equations 7 and 8), the optimal learning 
rate is decomposed into two factors, one term which is 
the inverse curvature (as is the case for batch second- 
order methods), and one novel term that depends on 
the noise in the gradient, relative to the expected 
squared norm of the gradient. Below, we approxi- 
mate these terms separately. For the investigations 
below, when we use the true values instead of a prac- 
tical algorithm, we speak of the 'oracle' variant (e.g. 
in Figure 3). 

4.1. Approximate Variability 

We use an exponential moving average with time- 
constant r (the approximate number of samples con- 
sidered from recent memory) for online estimates of 
the quantities in equations 7 and 8: 

m(t+i) = (i-O-stW+T^.v^t) 
vi(t+i) = (i-O-ww+T-r 1 -^)) 3 



i(t+i) 



where ~gl estimates the average gradient component i, 
Vi estimates the uncentered variance on gradient com- 
ponent i, and I estimates the squared length of the 
gradient vector: 



9i «E[V e 



t5«E[Vi] 



i»E IIV 



and we need Vi only for an element-wise adaptive learn- 
ing rate and I only in the global case. 

4.2. Adaptive Time-constant 

We want the size of the memory to increase when the 
steps taken are small (increment by 1), and to decay 
quickly if a large step (close to the Newton step) is 
taken, which is obtained naturally, by the following 
update 



n(t + i) 



Ti{tf 

Vi{t) 



n{t) 



Algorithm 1: Stochastic gradient descent with 
adaptive learning rates (element- wise, vSGD-1). 

repeat 

draw a sample S*>\ compute the gradient 
Vg \ and compute the diagonal Hessian 

estimates using the "bbprop" method 
for i € {1, . . . , d} do 

update moving averages 

gi <- (l-r- 1 ).g- + rr 1 .V^ ) 

w <- (i-^ 1 )-w+^- 1 -(vg ) ) 2 

hi <- {l-Tt l )-hi + Ti l ■ |bbprop(0) 



(./) 



estimate learning rate ry* <~ 

m ■ ■■ 

update memory size 

update parameter 9i <— 9i — ?7*V 
end 

until stopping criterion is met 



This way of making the memory size adaptive allows 
us to eliminate one otherwise tuning-sensitive hyper- 
parameter. Note that these updates (correctly) do 
not depend on the local curvature, making them scale- 
invariant. 

4.3. Approximate Curvature 

There exist a number of methods for obtaining an on- 
line estimates of the diagonal Hessian (Martens et al., 
2012; Bordes ct al., 2009; Chapelle & Erhan, 2011). 
We adopt the "bbprop" method, which computes pos- 
itive estimates of the diagonal Hessian terms (Gauss- 
Newton approximation) for a single sample h\ , using 
a back-propagation formula (LeCun ct al., 1998). The 
diagonal estimates are used in an exponential moving 
average procedure 

Ht + l) = {l-r^-h^ + r^-hf 

If the curvature is close to zero for some component, 
this can drive rf to infinity. Thus, to avoid numerical 
instability (to bound the condition number of the ap- 
proximated Hessian), it is possible to enforce a lower 
bound hi > e. This addition is not necessary in our ex- 
periments, due to the presence of an L2-regularization 
term. 

4.4. Slow-start Initialization 

To initialize these estimates, we compute the arith- 
metic averages over a handful (n = 0.001 x 
^traindata) of samples before starting to the main al- 
gorithm loop. We find that the algorithm works best 
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Figure 2. Illustration of the dynamics in a noisy quadratic 
bowl (with 10 times larger curvature in one dimension than 
the other). Trajectories of 400 steps from vSGD, and from 
SGD with three different learning rate schedules. SGD 
with fixed learning rate (crosses) descends until a certain 
depth (that depends on 77) and then oscillates. SGD with a 
1/t cooling schedule (pink circles) converges prematurely. 
On the other hand, vSGD (green triangles) is much less 
disrupted by the noise and continually approaches the op- 
timum. 



with a slow start heuristic, where the parameter up- 
dates are kept small until the exponential averages be- 
come sufficiently accurate. This is achieved by overes- 
timating vl and I) by a factor C. We find that setting 
C = d/10, as a rule of thumb is both robust and near- 
optimal, because the value of C has only a transient 
initialization effect on the algorithm. The appendix 
details how we arrived at this, and demonstrates the 
low sensitivity empirically. 

5. Adaptive Learning Rate SGD 

The simplest version of the method views each com- 
ponent in isolation. This form of the algorithm will be 
called "vSGD" (for "variance-based SGD"). In realis- 
tic settings with high-dimensional parameter vector, it 
is not clear a priori whether it is best to have a single, 
global learning rate (that can be estimated robustly), 
a set of local, dimension-specific rates, or block-specific 
learning rates (whose estimation will be less robust). 
We propose three variants on this spectrum: 
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Figure 3. Optimizing a noisy quadratic loss (dimension 
d = 1, curvature h = 1). Comparison between SGD for 
two different fixed learning rates 1.0 and 0.2, and two cool- 
ing schedules 77 = 1/t and 77 = 0.2/ 't, and vSGD (red cir- 
cles). In dashed black, the 'oracle' computes the true op- 
timal learning rate rather than approximating it. In the 
top subplot, we show the median loss from 1000 simulated 
runs, and below are corresponding learning rates. We ob- 
serve that vSGD initially descends as fast as the SGD with 
the largest fixed learning rate, but then quickly reduces the 
learning rate which dampens the oscillations and permits 
a continual reduction in loss, beyond what any fixed learn- 
ing rate could achieve. The best cooling schedule (77 = 1/t) 
outperforms vSGD, but when the schedule is not well tuned 
(77 = 0.2/t), the effect on the loss is catastrophic, even 
though the produced learning rates are very close to the 
oracle's (see the overlapping green crosses and the dashed 
black line at the bottom). 



vSGD-b operates like vSGD-g, but being only global 
across multiple (architecture-specific) blocks of 
parameters, with a different learning rate 
per block. Similar ideas are adopted in 
TONGA (Le Roux et al., 2008). In the experi- 
ments, the parameters connecting every two lay- 
ers of the network are regard as a block, with the 
corresponding bias parameters in separate blocks. 



vSGD-1 uses local gradient variance terms and the 
local diagonal Hessian estimates, leading to 77* = 

(Tt) 2 /(hi- Vi), 

vSGD-g uses a global gradient variance term and an 
upper bound on diagonal Hessian terms: r/* = 
U<H) 2 /(h + -l), 



The pseudocode for vSGD-1 is given in Algorithm 1, 
the other cases are very similar; all of them have linear 
complexity in time and space; in fact, the overhead of 
vSGD is roughly a factor two, which arises from the 
additional bbrop pass (which could be skipped in all 
but a fraction of the updates) - this cost is even less 
critical because it can be trivially parallelized. 
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Figure 4. Non-stationary loss. The loss is quadratic but now the target value (/i) changes abruptly every 300 time-steps. 
Above: loss as a function of time, below: corresponding learning rates. This illustrates the limitations of SGD with fixed or 
decaying learning rates (full lines): any fixed learning rate limits the precision to which the optimum can be approximated 
(progress stalls); any cooling schedule on the other hand cannot cope with the non-stationarity. In contrast, our adaptive 
setting ('vSGD', red circles), as closely resembles the optimal behavior (oracle, black dashes). The learning rate decays 
like 1/t during the static part, but increases again after each abrupt change (with just a very small delay compared to 
the oracle). The average loss across time is substantially better than for any SGD cooling schedule. 



6. Experiments 

We test the new algorithm extensively on a couple of 
toy problem first, and then follow up with results on 
well-known benchmark problems for digit recognition, 
image classification and image reconstruction, using 
the new SGD variants to train both convex models 
(logistic regression) and non-convex ones (multi-layer 
perceptrons) . 

6.1. Noisy Quadratic 

To form an intuitive understanding of the effects of 
the optimal adaptive learning rate method, and the 
effect of the approximation, we illustrate the oscilla- 
tory behavior of SGD, and compare the decrease in the 
loss function and the accompanying change in learning 
rates on the noisy quadratic loss function from Section 
3.1 (see Figure 2 and Figure 3), contrasting the effect 
of fixed rates or fixed schedules to adaptive learning 
rates, whether in approximation or using the oracle. 



6.2. Non-stationary Quadratic 

In realistic on-line learning scenarios, the curvature or 
noise level in any given dimension changes over time 
(for example because of the effects of updating other 
parameters), and thus the learning rates need to in- 
crease as well as increase. Of course, no fixed learning 
rate or fixed cooling schedule can achieve this. To il- 
lustrate this, we use again a noisy quadratic loss func- 
tion, but with abrupt changes of the optimum every 
300 timesteps. 

Figure 4 shows how vSGD with its adaptive memory- 
size appropriately handles such cases. Its initially large 
learning rate allows it to quickly approach the opti- 
mum, then it gradually reduces the learning rate as 
the gradient variance increases relative to the squared 
norm of the average gradient, thus allowing the param- 
eters to closely approach the optimum. When the data 
distribution changes (abruptly, in our case), the algo- 
rithm automatically detects that the norm of the av- 
erage gradient increased relative to the variance. The 
learning rate jumps back up and adapts to the new 
circumstances. Note that here and in section 6.1 the 
curvature is always 1, which implies that the precondi- 
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Figure 5. Training error versus test error on the three MNIST setups (after 6 epochs). Different symbol-color combinations 
correspond to different algorithms, with the best-tuned parameter setting shown as a much larger symbol than the other 
settings tried (the performance of Almeida is so bad it's off the charts). The axes are zoomed to the regions of interest 
for clarity, for a more global perspective, and for the corresponding plots on the CIFAR benchmarks, see Figures 6 and 7. 
Note that there was no tuning for our parameter-free vSGD, yet its performance is consistently good (see black circles). 
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Figure 6. Training error versus test error on the three CIFAR setups (after 6 epochs). Different symbol-color combinations 
correspond to different algorithms, with the best-tuned parameter setting shown as a much larger symbol than the other 
settings tried. The axes are zoomed to the regions of interest for clarity. Note how there is much more overfitting here 
than for MNIST, even with vanilla SGD. 
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Figure 7. Training error versus test error on all 6 setups, global perspective. Different symbol-color combinations cor- 
respond to different algorithms, with the best-tuned parameter setting shown as a much larger symbol than the other 
settings tried. 
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Table 1. Experimental setup for standard datasets MNIST and and the subset of CIFAR-10 using neural nets with 
hidden layer (M0 and CO), 1 hidden layer (Ml, CI and CR), 2 hidden layers (M2). Columns 4 through 13 give the 
best found hyper-parameters for SGD and the four adaptive algorithms used to compare vSGD to. Note that those 
hyper-parameters vary substantially across the benchmark tasks. 



tioning by the diagonal Hessian component vanishes, 
and still the advantage of adaptive learning rates is 
clear. 

6.3. Neural Network Training 

SGD is one of the most common training algorithms 
in use for (large-scale) neural network training. The 
experiments in this section compare the three vSGD 
variants introduced above with SGD, and some adap- 
tive algorithms described in section 2 (AdaGrad, 
Almeida, Amari and SMD), with additional details in 
the appendix. 

We exhaustively search for the best hyper-parameter 
settings among 770 G {lQ- 7 ,3 • 10" 7 , IO" 6 , . . . , 3 • 



10°, 10 1 }, 7 G {0,1/3, 1/2, l}/#traindata, r G 
{10 5 ,5 • 10 4 ,10 4 ,5 • 10 3 ,10 3 , lO 2 ,^ 1 ,} and fi G 
{10- 4 , IO" 3 , IO" 2 , 10" 1 } as determined by their lowest 
test error (averaged over 2 runs), for each individual 
benchmark setup. The last line in Table 3 shows the 
total number of settings over which the tuning is done. 

6.3.1. Datasets 

We choose two widely used standard datasets to test 
the different algorithms; the MNIST digit recogni- 
tion dataset (LeCun & Cortes, 1998) (with 60k train- 
ing samples, and 10k test samples), and the CIFAR- 
10 small natural image dataset (Krizhevsky, 2009), 
namely the 'batchl' subset, which contains 10k train- 
ing samples and 10k test samples. We use CIFAR 
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Table 2. Final classification error (and reconstruction error for C1FAR-2R) on the training set, obtained after 6 epochs 
of training, and averaged over ten random initializations. Variants are marked in bold if they don't differ statistically 
significantly from the best one (p = 0.01). Note that the tuning parameters of SGD, AdaGrad, SMD, Amari and 
Almeida are different for each benchmark (see Table 1). We observe the best results with the full element-wise learning 
rate adaptation ('vSGD-1'), almost always significantly better than the best-tuned SGD or best-tuned AdaGrad. 





vSGD-1 


vSGD-b 


vSGD-g 


SGD 


AdaGrad 


SMD 


Amari 


Almeida 


MO 


7.50% 


7.89% 


8.20% 


7.60% 


7.52% 


7.57% 


7.69% 


11.13% 


Ml 


2.42% 


2.44% 


4.14% 


2.34% 


2.70% 


2.37% 


3.95% 


8.39% 


M2 


2.16% 


2.05% 


3.65% 


2.15% 


2.34% 


2.18% 


2.97% 


7.32% 


CO 


66.05% 


61.70% 


61.10% 


61.06% 


61.25% 








CI 


57.72% 


59.55% 


60.62% 


58.85% 


58.67% 








CR 


11.05 


10.57 


15.71 


10.29 


10.33 


^settings 


1 


1 


1 


68 


17 


476 


119 


119 



Table 3. Final classification error (and reconstruction error for CIFAR-2R) on the test set, after 6 epochs of training, 
averaged over ten random initializations. Variants are marked in bold if they don't differ statistically significantly from 
the best one (p = 0.01). Note that the parameters of SGD, AdaGrad, SMD, Amari and Almeida were finely tuned, on 
this same test set, and were found to be different for each benchmark (see Table 1); the last line gives the total number 
of parameter settings over which the tuning was performed. Compared to training error, test set performance is more 
balanced, with vSGD-1 being better or statistically equivalent to the best-tuned SGD in 4 out of 6 cases. The main 
outlier (CO) is a case where the more aggressive element-wise learning rates led to overfitting (compare training error in 
Table 2). 



both to learn image classification and reconstruction. 
The only form of preprocessing used (on both datasets) 
is to normalize the data by substracting mean of the 
training data along each input dimension. 

6.3.2. Network Architectures 

We use four different architectures of feed-forward neu- 
ral networks. 

• The first one is simple softmax regression (i.e., a 
network with no hidden layer) for multi-class clas- 
sification. It has convex loss (cross-entropy) rel- 
ative to parameters. This setup is denoted 'MO' 
for the MNIST case, and 'CO' for the CIFAR clas- 
sification case. 

• The second one (denoted 'Ml'/'Cl') is a fully con- 
nected multi-layer perceptron, with a single hid- 
den layers, with tanh non-linearities at the hid- 
den units. The cross-entropy loss function is non- 
convex. 

• The third (denoted 'M2', only used on MNIST) 
is a deep, fully connected multi-layer perceptron, 



with a two hidden layers, again with tanh non- 
linearities. 

• The fourth architecture is a simple autoen- 
coder (denoted 'CR'), with one hidden layer 
(tanh non-linearity) and non-coupled reconstruc- 
tion weights. This is trained to minimize the mean 
squared reconstruction error. Again, the loss is 
non-convex w.r.t. the parameters. 

Formally, given input data ho = x, the network pro- 
cesses sequentially through H > hidden layers by 
applying affine transform then an element-wise tanh, 

hk+i = tanh(Wkh k + b k ), k = 0, ■ ■ • , H — 1. 

The output of the network y — hu+i — Wuhu + bn is 
then feed into the loss function. For cross-entropy loss, 
the true label c gives the target (delta) distribution to 
approximate, thus the loss is 

E[KL(S c \\p y )]=E[-log( Py (c))}, 

where 

exp-f( c > 
Py[c] ~ £ fc exp-^r 
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Figure 8. Learning curves for full-length runs of 100 
episodes, using vSGD-1 on the Ml benchmark with 800 
hidden units. Test error is shown in red, training error is 
green. Note the logarithmic scale of the horizontal axis. 
The average test error after 100 epochs is 1.87%. 

For mean-squared reconstruction error, the loss is 

E^lk-yHl] (9) 

The exact numbers of hidden units in each layer, and 
the corresponding total problem dimensions are given 
in Table 1. The parameters are initialized randomly 
based on Glorot & Bengio. 

To avoid over-fitting, especially for CIFAR which has 
a comparatively small dataset, we add f ||ui|||, a L 2 
regularization term on the weights, to the loss in all 
experiments (with A = 10~ 4 ). This also avoids nu- 
merical instability in vSGD-1, because the estimated 
diagonal Hessian elements will almost never be close 
to zero. 

6.3.3. Results 

For each benchmark, ten independent runs are aver- 
aged and reported in Table 2 (training set) and Table 3 
(test set). They show that the best vSGD variant, 
across the board, is vSGD-1, which most aggressively 
adapts one learning rate per dimension. It is almost 
always significantly better than the best-tuned SGD 
or best-tuned AdaGrad in the training set, and bet- 
ter or statistically equivalent to the best-tuned SGD 
in 4 out of 6 cases on the test set. The main out- 
lier (CO) is a case where the more aggressive element- 
wise learning rates led to overfitting (compare training 
error in Table 2), probably because of the compara- 
tively small dataset size. Figure 5 illustrates the sensi- 
tivity to hyper-parameters of SGD, AdaGrad, SMD 
and Amari's natural gradient on the three MNIST 
benchmarks: different settings scatter across the per- 
formance scale adn tuning matters. This is in stark 



contrast with vSGD, which without tuning obtains 
the same performance than the best-tuned algorithms. 
Figure 6 does the same for the three CIFAR bench- 
marks, and Figure 7 provides a more global perspec- 
tive (zoomed out from the region of interest). 

Figure 9 shows the evolution of (minimal/maximal) 
learning rates over time, emphasizing the effects of 
slow-start initialization in our approach, and Figure 8 
shows the learning curve over 100 epochs, much longer 
than the remainder of the experiments. 

7. Conclusions 

Starting from the idealized case of quadratic loss con- 
tributions from each sample, we derived a method to 
compute an optimal learning rate at each update, and 
(possibly) for each parameter, that optimizes the ex- 
pected loss after the next update. The method relies 
on the square norm of the expectation of the gradi- 
ent, and the expectation of the square norm of the 
gradient. We showed different ways of approximat- 
ing those learning rates in linear time and space in 
practice. The experimental results confirm the theo- 
retical prediction: the adaptive learning rate method 
completely eliminates the need for manual tuning of 
the learning rate, or for systematic search of its best 
value. 

Our adaptive approach makes SGD more robust in two 
related ways: (a) When used in on-line training sce- 
narios with non-stationary signals, the adaptive learn- 
ing rate automatically increases when the distribution 
changes, so as to adjust the model to the new distri- 
bution, and automatically decreases in stable periods 
when the system fine-tunes itself within an attractor. 
This provides robustness to dynamic changes of the 
optimization landscape, (b) The tuning-free property 
implies that the same algorithm can adapt to drasti- 
cally different circumstances, which can appear within 
a single (deep or heterogeneous) network. This ro- 
bustness alleviates the need for careful normalizations 
of inputs and structural components. 

Given the successful validation on a variety of clas- 
sical large-scale learning problems, we hope that this 
enables for SGD to be a truly user-friendly 'out-of-the- 
box' method. 
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Figure 9. Evolution of learning rates. It shows how the learning rates (minimum and maximum across all dimensions) 
vary as a function of the epoch. Left: CIFAR classification (no hidden layer), right: MNIST classification (no hidden 
layer). Each symbol/color corresponds to the median behavior of one algorithm. The range of learning rates (for those 
algorithms that don't have a single global learning rate) is shown in a colored band in-between the min/max markers. 
The log- log plot highlights the initial behavior, namely the 'slow start' (until about 0.1 epochs) due to a large C constant 
in out methods, which contrasts with the quick start of AdaGrad. We also note that AdaGrad (yellow circles) has 
drastically different ranges of learning rates on the two benchmarks. 



A. Convergence Proof 

If we do gradient descent with 77* (i), then almost 
surely, the algorithm converges (for the quadratic 
model). To prove that, we follow classical techniques 
based on Lyapunov stability theory (Bucy, 1965). No- 
tice that the expected loss follows 



J 
h-E 



(t+i) 



W 



((1 - 77*/i)(6» (i) - 6*) + ^ha^j 2 + a 2 



< 



(0« 



>) 2 + a 2 



(t)_ r) 2 + (T 2 



Thus J "(0W) is a positive super-martingale, indicating 
that almost surely J(0W) — » J°°. We are to prove 
that almost surely J°° = J (8*) — \ha 2 . Observe that 

J(0<*>) - E[J(6»(* +1 )) I 0W] = i/nj*(t) , 

E[J(0 (t) )] -E[J(0 (t+1) ) I (t) ] = ifcEfo*(t)] 

Since E[J(0('))] is bounded below by 0, the telescoping 
sum gives us E[ry*(t)] — > 0, which in turn implies that 
in probability 77* (t) —> 0. We can rewrite this as 



J(0 t ) - \ha 2 
J{0t) 



By uniqueness of the limit, almost surely, — — 

0. Given that J is strictly positive everywhere, we 
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Figure 10. Critical values for initialization parameter C. 
This plot shows the values of C below which vSGD-1 be- 
comes unstable (too large initial steps). We determine the 
critical C value as the largest for which at least 10% of 
the runs give rise to instability. The markers correspond 
to experiments with setups on a broad range of parameter 
dimensions. Six markers correspond to the benchmark se- 
tups from the main paper, and the green stars correspond 
to simple the XOR-classification task with an MLP of a 
single hidden layer, the size of which is varied from 2 to 
500000 neurons. The black dotted diagonal line indicates, 
our 'safe' heuristic choice of C = d/10. 



conclude that J°° = \ha 2 almost surely, i.e J (9^) — > 
\ha 2 = J{9*). 

B. Derivation of Global Learning Rate 

We can derive an optimal global learning rate n* as 
follows. 



}(*) 



Q*\2 



1=1 



gminE \j | 6> ( 

d 

+a 2 +n 2 h 2 a 2 ) 



arg mm 

v 



which gives 



Eti (hwt ] -n) 2 +h^ 



The adaptive time-constant for the global case is: 



r fl (t + l) 



C. SMD Implementation 

The details of our implementation of SMD (based on 
a global learning rates) are given by the following up- 
dates: 



6t+i 
Vt+i 

Vt+l 



(1 



?7tV e 

l )vt ~ Vt (V e + (1 



0i 



T) t exp (-/xVe v t 



H 4 v t 



where Hv denotes the Hessian-vector product with 
vector v, which can be computed in linear time. The 
three hyper-parameters used are the initial learning 
rate 770, the meta-learning rate /i, and a time constant 
t for updating the auxiliary vector v. 

D. Sensitivity to Initialization 

Figure 11 shows that the initialization parameter C 
does not affect performance, so long as it is sufficiently 
large. This is not surprising, because its only effect 
is to slow down the initial step sizes until accurate 
exponential averages of the interesting quantities can 
be computed. 

There is a critical minimum value of C, blow which 
the algorithm is unstable. Figure 10 shows what those 
critical values are for 13 different setups with widely 
varying problem dimension. From these empirical re- 
sults, we derive our rule-of-thumb choice of C = d/10 
as a 'safe' pick for the constant (in fact it is even a fac- 
tor 10 larger than the observed critical value for any 
of the benchmarks, just to be extra careful). 
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